Bioinformatics An Introduction 4th Edition (Jeremy Ramsden)

108

Probability and Likelihood

provide a convenient way to normalize (render dimensionless) a random variable,

namely

bold upper X Superscript bold asterisk Baseline equals StartFraction bold upper X minus mu Subscript upper X Baseline Over sigma Subscript upper X Baseline EndFraction periodX^∗= ^X⁻^μ^X

σX

(9.35)

The covariance measures the linear association between variables bold upper XX and bold upper YY and is

deﬁned as

Cov left parenthesis bold upper X bold comma bold upper Y right parenthesis equals bold upper E left parenthesis bold upper X minus bold upper E left parenthesis bold upper X right parenthesis right parenthesis bold upper E left parenthesis bold upper Y minus bold upper E left parenthesis bold upper Y right parenthesis right parenthesis equals bold upper E left parenthesis bold upper X upper Y right parenthesis minus bold upper E left parenthesis bold upper X right parenthesis bold upper E left parenthesis bold upper Y right parenthesisCov(X, Y) = E(X −E(X))E(Y −E(Y)) = E(XY) −E(X)E(Y)

(9.36)

explicitly, as

Cov left parenthesis bold upper X bold comma bold upper Y right parenthesis equals StartFraction 1 Over n EndFraction sigma summation Underscript j equals 1 Overscript n Endscripts left parenthesis x Subscript j Baseline minus mu Subscript upper X Baseline right parenthesis left parenthesis y Subscript j Baseline minus mu Subscript upper Y Baseline right parenthesis periodCov(X, Y) = ¹

j=1

(x j −μX)(y j −μY) .

(9.37)

It equals zero if the variables are independent (uncorrelated). The correlation coef-

ﬁcient rho left parenthesis bold upper X bold comma bold upper Y right parenthesisρ(X, Y) is a normalized covariance:

rho left parenthesis bold upper X bold comma bold upper Y right parenthesis equals StartFraction Cov left parenthesis bold upper X bold comma bold upper Y right parenthesis Over sigma Subscript x Baseline sigma Subscript y Baseline EndFraction periodρ(X, Y) = ^Cov⁽^X^,^Y⁾

σxσy

(9.38)

It is connected with the linear dependence of bold upper XX and bold upper YY, but can be zero even if bold upper YY is

a function of bold upper XX. If more than two variables are involved, it is convenient to arrange

the pairwise covariances in the so-called covariance matrix. The scatter matrix upper SS of

nn samples of mm-dimensional data is deﬁned as

upper S equals sigma summation Underscript j equals 1 Overscript n Endscripts left parenthesis bold upper X Subscript j Baseline minus bold upper E left parenthesis bold upper X right parenthesis right parenthesis left parenthesis bold upper X Subscript j Baseline minus bold upper E left parenthesis bold upper X right parenthesis right parenthesis Superscript normal upper T Baseline periodS =

j=1

(X j −E(X))(X j −E(X))^T.

(9.39)

If the variables are normally distributed, the (normalized) scatter matrix provides an

estimate of the covariance matrix.

Problem. Calculate the means and variances of the binomial and Poisson distribu-

tions.

9.3.1

Runs

Studies of the statistical properties of DNA and the like often start by stating the

total number of the four bases A, C, T, and G. This information entirely neglects

information on the order in which they occur. The theory of the distribution of runs

is one way of handling this information. A run is deﬁned as a succession of similar

events preceded and succeeded by different events; the number of elements in a run

will be referred to as its length. The number of runs of course equals the number of

unlike neighbours.